Introduction

The Wine Quality dataset consists of red wine samples. We will be analyzing a dataset with 1,599 red wine samples. Each wine sample comes with a quality rating from one ( bad quality) to ten ( high quality) . In this project we will discover which chemical propeties influence the quality of red wines and to understand how these characteristics influence the quality

Check duplicated values

## [1] 0

Check data dimension

## [1] 1599   13

There are 1599 observations and 13 variables in the dataset.

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

AS we can see above in histograms density and PH normally disributed but the rest of variables are more or less right skewed. The quality dependent variable has almost normal discrete distribution.

Most wines between 5 and 6 .If we see rare win with high quality with rate 8 also with bad quality (3,4) rate .Rate 7 has almost 200 .We goning investigate more below about these different observations .

##  [0,5)  [5,7) [7,10] 
##     63   1319    217
## 
##    Low Medium   High 
##     63   1319    217

As you can see above we demonstrating wine quality (LOW 0,5) , (Medium 5,7) (high 7,10 ) . We can see clearly most wine fill as Medium .The chart above make me confidence with the data quality no outlier .

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality quality.category
## 1       5           Medium
## 2       5           Medium
## 3       5           Medium
## 4       6           Medium
## 5       5           Medium
## 6       5           Medium

mean for variables based on highest,average and lowest score.

##   quality fixed.acidity volatile.acidity citric.acid residual.sugar  chlorides
## 1       3      8.360000        0.8845000   0.1710000       2.635000 0.12250000
## 2       5      8.254284        0.5385595   0.2582638       2.503867 0.08897271
## 3       8      8.566667        0.4233333   0.3911111       2.577778 0.06844444
##   free.sulfur.dioxide total.sulfur.dioxide   density       pH sulphates
## 1            11.00000             24.90000 0.9974640 3.398000 0.5700000
## 2            16.36846             48.94693 0.9968673 3.311296 0.6472631
## 3            13.27778             33.44444 0.9952122 3.267222 0.7677778
##    alcohol
## 1  9.95500
## 2 10.25272
## 3 12.09444
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

The average mean quality of red wines is 5.63 and median is 6 .I think we have outliers here with free.sulfur and total.sulffur becuse the number dose it make since at all with reating . The wine samples with the highest score have the lowest level of density, volatile acidity, pH, and sugar the lowest score has the same median.

So now the big question what’s the factor has impact on wine values and rating . ## Attributes below increase values and rating 1.Alcohol

2.fixed acidity

3.citric acid

4.sulphates

Attributes below decrease values and rating

1.Density

2.volatile acidity

3.pH

4.sugar

Fixed Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Data above is lightly right skewed with minimum value of 4.5, maximum of 15.7 and median of 7 and mean of 8. The boxplot shows a few outliers from 12 to 16.

Volatile Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

The variability between low and high quality categories is high comparing to other variables . There are a few outliers between the higher range, around 1.0 to 1.6 and median of 0.52 and mean of 0.52

Citric Acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Citric Acid data is right skewed with minimum value of 0, maximum of one outlayer and median of 0.26 and mean of 0.27.

Sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

Data is right skewed with minimum value of 3 , maximum of 15.8 many outliers here .and median of 2.2 and mean of 2.5 .

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

chlorides data is right skewed with minimum value of 0.012, maximum of 0.61 median of 0.079 and mean of 0.087.

Free Sulfur Dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Free Sulfur Dioxide data is right skewed with minimum value of 1, maximum of 72 and median of 14 and mean of 15.87.

Total Sulfur Dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Sulfur dioxide data is right skewed with minimum value of 6, maximum of 289 (outlayers) and median of 38 and mean of 46.47.

Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

density data is normal with minimum value of 0.9901, maximum of 1.0037 and median of 0.9968 and mean of 0.996

pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

pH data is normal with minimum value of 2.740, maximum of 4.010 and median of 3.310 and mean of 3.311 .

Sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

sulphates data is right skewed with minimum value of 0.33, maximum of 2 and median of 0.62 and mean of 0.658

Alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

alcohol data is right skewed but does not have many outliers with minimum value of 8.4, maximum of 14.9 and median of 10.2 and mean of 10.42.

Less volatile.acidity in a sample results in higher wine quality, * The bigger citric.acid level is in a sample on average the better quality of the sample is. The samples with citric.acid level above 0.5 will almost never be classsified as of Low quality, * The bigger sulphates level is in a sample on average the better quality of the sample is. However, the sulphates values are less spread than values of other variables, * Only alcohol level above 12 gives more certainty that the sample will be considered as of Medium or High quality. If the alcohol level goes below 10 a sample will most likely be considered as of a Medium or Low quality.

The correlation matrix shows that fixed.acidity is highly positively correlated with density and citric.acid. total.sulfur.dioxide is highly positively correlated with free.sulful.dioxide. pH is highly negatively correlated with fixed.acidity. citric.acid is correlated negatively with volatile.acidity and pH

Quality vs Alcohol

## Red_wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.575  11.000 
## ------------------------------------------------------------ 
## Red_wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## ------------------------------------------------------------ 
## Red_wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## ------------------------------------------------------------ 
## Red_wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## ------------------------------------------------------------ 
## Red_wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## ------------------------------------------------------------ 
## Red_wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

The trend between alcohol and quality is clearer, with the highest quality score having the largest median. In other words, the amount of alcohol increases with better quality raking. Additionally, most outliers have a score of 5, and that explains why the median is lower than score of 4.

Multivariate Plots Section

Citric acid vs Fixed acidity and Volatile acidity on quality

In the first scatterplot it shows that the good quality wines have more fixed acidity and more citric acid And in second scatterplot it shows good wines with low volatile and high cirtic acid contents .

pH vs Alcohol on Quality

In scatterplot above we see that for the same value of pH the quality of wine increases as the alcohol content increases

Sugar vs alcohol on Quality

Alcohol has same relation with sugar as with acids for the quality

Final Plots and Summary

ِEvery variable distribution and density differences explored from different perspectives: through a histogram, a histogram with a log10 scale, density chart, and box plot for all variables. 80% wines have an average score Alcohol, fixed acidity, citric acid, sulphates increase with a better rating. Density, volatile acidity, pH, and sugar decrease with a better rating.

Most wine samples are of 6 and 5 (almost 80% of the dataset). Moreover, it seems to be that wines which received the highest score (8) have a few observations. This situation repeats at a low level (3, 4). Wines with a score of 7 have 200 observations.

The trend between alcohol and quality is clearer, with the highest quality score having the largest median. In other words, the amont of alcohol increases with a better quality raking. Also, most outliers have a score of 5, and that explains why the median is lower than a score of 4.

Reflection

The red wine data set contains information on almost 1,600 red wine samples across 12 chemical properties .Almost 80% of our dataset received an average score (5,6) and the highest score (8) holds only 1% (18 rows) of observations.The mean was not totally reliable in a few attributes as sugar and chlorides. These attributes had a significant difference from the median .In the future, there could be more features added (grown country, weather conditions, wine making process specifics, etc.) to the dataset . The corrplot was crucial to understand the interactions of the chemicals which required further research, emphasizing the need to understand the basics of the domain in order to perform effective analysis.Another area which required a lot of effort was in visualizing the interations-how best to capture the not so obvious relationship between the variables which can be telling. Finally this first time i’m using R language it’s was little hard , specially for someone was using java ، c language .